Small tweaks to the devdocs #72

timholy · 2020-03-08T11:47:36Z

The new devdocs are wonderful! They are really helping me get a better handle on the internals.

These are a few tweaks that helped clarify my thinking, but different folks might think differently. Feel free to close this if you think it's a step backwards.

The change to the prettyurls setting makes it easier to browse the docs in a local build. I also added [compat] bounds on Documenter because I've had my docs break when new releases are made.

One thing I was intrigued by is where the costs come from. I tried to put in one reference, but if you prefer another one by all means use it instead. And I'm pretty vague on what the register pressure is actually measuring (what are its units?), and didn't even try on the scaling parameter.

codecov-io · 2020-03-08T11:58:44Z

Codecov Report

Merging #72 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master      #72   +/-   ##
=======================================
  Coverage   92.91%   92.91%           
=======================================
  Files          26       26           
  Lines        2836     2836           
=======================================
  Hits         2635     2635           
  Misses        201      201

Impacted Files	Coverage Δ
src/costs.jl	`84.9% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a31acc4...b47c95f. Read the comment docs.

chriselrod · 2020-03-08T17:37:41Z

I updated the Manifest file to Documenter 0.24.5, and added a little more to docs/src/evaluating_loops.md

You can also reciprocal throughput and latency, e.g.:

using CpuId, VectorizationBase, SIMDPirates, SLEEFPirates, VectorizedRNG

@generated function estimate_cost_onearg(f::F, N::Int = 512, K = 1_000, ::Type{T} = Float64, ::Val{U} = Val(4)) where {F,T,U}
    W, Wshift = VectorizationBase.pick_vector_width_shift(T)
    quote    
        Base.Cartesian.@nexprs $U u -> s_u = vbroadcast(Vec{$W,$T}, zero(T))
        # s = vbroadcast(V, zero(T))
        x = rand(T, N << $Wshift)
        ptrx = pointer(x)
        ts_start, id_start = cpucycle_id()
        for k ∈ 1:K
            _ptrx = ptrx
            for n ∈ 1:N>>$(VectorizationBase.intlog2(U))
                Base.Cartesian.@nexprs $U u -> begin
                    v_u = vload(Vec{$W,$T}, _ptrx)
                    s_u = vadd(s_u, f(v_u))
                    _ptrx += VectorizationBase.REGISTER_SIZE
                end
            end
        end
        ts_end, id_end = cpucycle_id()
        @assert id_start == id_end
        Base.Cartesian.@nexprs $(U-1) u -> s_1 = vadd(s_1, s_{u+1})
        (ts_end - ts_start) / (N*K), vsum(s_1)
    end
end

I'm sure this could be improved. It adds an extra add and load instruction, but my concern was that LLVM may optimize things away.
For now, it's useful for getting rough latency estimates for functions:

julia> first(estimate_cost_onearg(SLEEFPirates.log, 512, 10_000, Float64, Val(1))) # 51 cycles # 44
13.4911943359375

julia> first(estimate_cost_onearg(SLEEFPirates.log, 512, 10_000, Float64, Val(2))) # 51 cycles # 40
13.177637109375

julia> first(estimate_cost_onearg(SLEEFPirates.log, 512, 10_000, Float64, Val(4))) # 51 cycles # 39
13.1288251953125

julia> first(estimate_cost_onearg(SLEEFPirates.exp, 512, 10_000, Float64, Val(1))) # 51 cycles # 44
14.2456966796875

julia> first(estimate_cost_onearg(SLEEFPirates.exp, 512, 10_000, Float64, Val(2))) # 51 cycles # 40
14.753721484375

julia> first(estimate_cost_onearg(SLEEFPirates.exp, 512, 10_000, Float64, Val(4))) # 51 cycles # 39
13.128287109375

Let me know if you think it's ready and I'll merge it.

chriselrod · 2020-03-08T19:55:21Z

One thing I was intrigued by is where the costs come from. I tried to put in one reference

Most of them came from there (specifically, Skylake-X), so I've now mentioned that.

I used the approach with cpucycle_id() to estimate some of them for special functions, but these vary a lot from one computer to another. Currently the costs file lists "20" as the reciprocal throughput and latency, but I just measured something closer to 13.

And I'm pretty vague on what the register pressure is actually measuring (what are its units?)

The unit is "number of floating point registers". AVX512 systems have 32, and other x86 cpus have 16 registers. It is set to 0 for most instructions, under the heuristic assumption that they won't net-consume any extra registers. I think that is unlikely to be wrong in practice; generally at least one of the arguments won't be used anymore, so that the register it occupied now becomes available.

The primary exceptions are

Loading (getindex) and defining a constant
Special functions which will (if they're not inlined) clear all the available registers, or (if they are) require loading a lot of constants (mostly polynomial coefficients).

The register pressure comes into play when solving for tile size. That is, if it is considering unrolling 2 loops, it solves the constrained optimization problem of minimizing cost without consuming any more registers than the CPU cores have available.
With AVX512, operations like matrix multiplication seemed to prefer 5x5 unrolling, although this performed worse in some benchmarks, so I set an unrolling limit of 4.
Now that I'm reevaluating it, perhaps I should more thoroughly benchmark across a range of sizes to see if that holds up. If it does, I would prefer a more disciplined approach that models why it would be that 5x5 performs worse than 4x4, but failing that, if heuristics work...

and didn't even try on the scaling parameter.

Looking at the instruction tables you linked, on page 271 (Skylake-X), reciprocal throughput
for (sd means single double, pd packed; x, y, and z means 2, 4, and 8 doubles respectively(
vfmadd x,y: 0.5
vfmadd z: 0.5-1 (I used 0.5)

divsd: 4
divpd x: 4
divpd y: 8
divpd z: 16

sqrtsd: 4-6
sqrtpd x: 4-6
sqrtpd y: 9-12
sqrtpd z: 18-24 (first entry on page 272)

Most instructions fell into one of two categories: either their cost was independent of the length of the vectors*, or increased in direct proportion to vector length.
Vectorizing division or sqrt wont give the same performance benefit as vectorizating contiguous loads/stores, addition/subtraction/multiplication (or the fused versions).

I could probably represent this with bigger tables, but for now it's mostly using sentinel values to indicate how the instruction cost will change as a function of vector width.

*If you weren't running any vectorized instructions, clock frequency could increase, but @avx will vectorize something. In choosing that something, it helps to know how profitable each option is.
Note, this required some tweaking in SIMDPirates to work as intended. Mostly, that meant removing some "fast" flags, otherwise LLVM would overrule some of the decisions LoopVectorization made, and make the code >7x slower. (I think long term, it may be best to move all the reassociation and contraction/ instruction combining decisions to LoopVectorization I'm already having it use asm call in some cases...).

This is more about which you want to hoist out of the loop: the squaring, or the inversion. Given fast math flags, LLVM will choose "neither" (it'll replace the single inversion followed by a multiplication on each iteration with repeated divisions).

julia> using LoopVectorization, BenchmarkTools

julia> function contrived_example1(x, y)
           s = zero(promote_type(eltype(x), eltype(y)))
           @inbounds for i in eachindex(x); @simd for j in eachindex(y)
               s += inv(x[i]) * abs2(y[j])
           end; end
           s
       end
contrived_example1 (generic function with 1 method)

julia> function contrived_example2(x, y)
           s = zero(promote_type(eltype(x), eltype(y)))
           @inbounds for j in eachindex(y); @simd for i in eachindex(x)
               s += inv(x[i]) * abs2(y[j])
           end; end
           s
       end
contrived_example2 (generic function with 1 method)

julia> function contrived_example_avx1(x, y)
           s = zero(promote_type(eltype(x), eltype(y)))
           @avx for i in eachindex(x), j in eachindex(y)
               s += inv(x[i]) * abs2(y[j])
           end
           s
       end
contrived_example_avx1 (generic function with 1 method)

julia> function contrived_example_avx2(x, y)
           s = zero(promote_type(eltype(x), eltype(y)))
           @avx for j in eachindex(y), i in eachindex(x)
               s += inv(x[i]) * abs2(y[j])
           end
           s
       end
contrived_example_avx2 (generic function with 1 method)

julia> x = rand(200); y = rand(200);

julia> @btime contrived_example1($x, $y)
  4.563 μs (0 allocations: 0 bytes)
101147.92090855418

julia> @btime contrived_example2($x, $y)
  20.870 μs (0 allocations: 0 bytes)
101147.9209085543

julia> @btime contrived_example_avx1($x, $y)
  2.425 μs (0 allocations: 0 bytes)
101147.92090855431

julia> @btime contrived_example_avx2($x, $y)
  2.534 μs (0 allocations: 0 bytes)
101147.92090855431

It would be nice if I could rely on LLVM for some of these optimization decisions, but it seems easier to find cases where fastmath flags cause regressions than where they help, once I've already searched the expression to substitute a * b + c with muladds.

timholy · 2020-03-08T23:32:45Z

I approve your changes. Both the changes to the docs and your replies are very informative as always! Really enjoying learning more about CPUs from an obvious master!

timholy · 2020-03-08T23:33:42Z

It occurs to me that one option for the future might be a build step in which we measure the costs on the specific machine on which this is being built.

chriselrod · 2020-03-09T01:06:03Z

Thanks -- I'm flattered. You're obviously a Julia-master, with an impressive array of popular packages used by many/most in the community.

It occurs to me that one option for the future might be a build step in which we measure the costs on the specific machine on which this is being built.

I think this could be done fairly quickly, so it'd be pretty reasonable.

timholy added 2 commits March 8, 2020 06:45

Small tweaks to the devdocs

1aaa964

Fix typo

6030e21

timholy changed the title ~~Teh/devdocs~~ Small tweaks to the devdocs Mar 8, 2020

Update Manifest.toml and expand a little on evaluating_loops.

b47c95f

chriselrod merged commit 1fddba8 into JuliaSIMD:master Mar 9, 2020

timholy deleted the teh/devdocs branch March 9, 2020 01:09

chriselrod mentioned this pull request Aug 2, 2020

Slow vmapntt! on Ryzen #141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small tweaks to the devdocs #72

Small tweaks to the devdocs #72

timholy commented Mar 8, 2020

codecov-io commented Mar 8, 2020 •

edited

Loading

chriselrod commented Mar 8, 2020 •

edited

Loading

chriselrod commented Mar 8, 2020 •

edited

Loading

timholy commented Mar 8, 2020

timholy commented Mar 8, 2020

chriselrod commented Mar 9, 2020

Small tweaks to the devdocs #72

Small tweaks to the devdocs #72

Conversation

timholy commented Mar 8, 2020

codecov-io commented Mar 8, 2020 • edited Loading

Codecov Report

chriselrod commented Mar 8, 2020 • edited Loading

chriselrod commented Mar 8, 2020 • edited Loading

timholy commented Mar 8, 2020

timholy commented Mar 8, 2020

chriselrod commented Mar 9, 2020

codecov-io commented Mar 8, 2020 •

edited

Loading

chriselrod commented Mar 8, 2020 •

edited

Loading

chriselrod commented Mar 8, 2020 •

edited

Loading